This page has step-by-step instructions for input modelling in Python or R, with inspiration from Robinson (2007) and Monks (2024). For advice on making your input modelling workflow reproducible and sharing data or scripts with sensitive content, see Input data management.
Data
To build a DES model, you first need data that reflects the system you want to model. In healthcare, this might mean you need to access healthcare records with patient arrival, service and departure times, for example. The quality of your simulation depends directly on the quality of your data. Key considerations include:
Accuracy. Inaccurate data leads to inaccurate simulation results.
Sample size. Small samples can give misleading results if they capture unusual periods, lack variability, or are affected by outliers.
Level of detail. The data must be granular enough for your needs. For example, daily totals may be insufficient if you want to model hourly arrivals (although may still be possible if know distribution - see Input modelling)
Representative. The data should reflect the current system. For instance, data from the COVID-19 period may not represent typical operations.
How is this data used in the model?
Discrete-event simulation (DES) models are stochastic, which means they incorporate random variation, to reflect the inherent variability of real-world systems.
Instead of using fixed times for events (like having a patient arrive exactly every five minutes), DES models pick the timing of events by randomly sampling values from a probability distribution.
The process of selecting the most appropriate statistical distributions to use in your model is called input modelling.
Input modelling
When selecting appropriate distributions, if you only have summary statistics (like the mean), you may need to rely on expert opinion or the general properties of the process you’re modelling. For example:
Arrivals: Random independent arrivals are often modelled with the Poisson distribution, whilst their inter-arrival times are modelled using an exponential distribution (Pishro-Nik (2014)).
Length of stay: Length of stay is commonly right skewed (Lee, Fung, and Fu (2003)), and so will often be modelled with distributions like exponential, gamma, log-normal (for log-transformed length of stay) or Weibull.
However, these standard choices may not always be appropriate. If the actual process differs from your assumptions or has unique features, the chosen distribution might not fit well.
Therefore, if you have enough data, it’s best to analyse it directly to select the most suitable distributions. This analysis generally involves two key steps:
Identify possible distributions. This is based on knowledge of the process being modelled, and by inspecting the data using time series plots and histograms.
Fit distributions to your data and compare goodness-of fit. You can do this using a:
Targeted approach. Just test the distributions from step 1.
Comprehensive approach. Test a wide range of distributions.
Though the comprehensive approach tests lots of different distributions, it’s still important to do step 1 as:
It’s important to be aware of temporal patterns in the data (e.g. spikes in service length every Friday).
You may find distributions which mathematically fit but are contextually inappropriate (e.g. normal distribution for service times, which can’t be negative).
You may find better fit for complex distributions, even when simpler are sufficient.
We’ll demonstrate steps for input modelling below using synthetic arrival data from the nurse visit simulation (Example conceptual models). In this case, we already know which distributions to use (as we sampled from them to create our synthetic data!). However, the steps still illustrate how you might select distributions in practice with real data.
Step 1. Identify possible distributions
You first need to select which distributions to fit to your data. You should both:
Consider the known properties of the process being modelled (as above), and-
Inspect the data by plotting a histogram.
Regarding the known properties, it would be good to consider the exponential distribution for our arrivals, as that is a common choice for random independent arrivals.
To inspect the data, we will create two plots:
Plot type
What does it show?
Why do we create this plot?
Time series
Trends, seasonality, and outliers (e.g., spikes or dips over time).
To check for stationarity (i.e. no trends or sudden changes). Stationary is an assumption of many distributions, and if trends or anomalies do exist, we may need to exclude certain periods or model them separately. The time series can also be useful for spotting outliers and data gaps.
Histogram
The shape of the data’s distribution.
Helps identify which distributions might fit the data.
We repeat this for the arrivals and service (nurse consultation) time, so have created functions to avoid duplicate code between each.
First, we import the data.
from distfit import distfitimport numpy as npimport pandas as pdimport plotly.express as pximport plotly.graph_objects as goimport scipy.stats as stats# Import datadata = pd.read_csv("../../data/NHS_synthetic.csv", dtype={"ARRIVAL_TIME": str,"SERVICE_TIME": str,"DEPARTURE_TIME": str})# Preview datadata.head()
ARRIVAL_DATE
ARRIVAL_TIME
SERVICE_DATE
SERVICE_TIME
DEPARTURE_DATE
DEPARTURE_TIME
0
2025-01-01
0001
2025-01-01
0007
2025-01-01
0012
1
2025-01-01
0002
2025-01-01
0004
2025-01-01
0007
2
2025-01-01
0003
2025-01-01
0010
2025-01-01
0030
3
2025-01-01
0007
2025-01-01
0014
2025-01-01
0022
4
2025-01-01
0010
2025-01-01
0012
2025-01-01
0031
We then calculate the inter-arrival times.
# Combine date/time and convert to datetimedata["arrival_datetime"] = pd.to_datetime( data["ARRIVAL_DATE"] +" "+ data["ARRIVAL_TIME"].str.zfill(4),format="%Y-%m-%d %H%M")# Sort by arrival time and calculate inter-arrival timesdata = data.sort_values("arrival_datetime")data["iat_mins"] = ( data["arrival_datetime"].diff().dt.total_seconds() /60)
We also calculate the service times.
# Combine dates with timesdata["service_datetime"] = pd.to_datetime( data["SERVICE_DATE"] +" "+ data["SERVICE_TIME"].str.zfill(4))data["departure_datetime"] = pd.to_datetime( data["DEPARTURE_DATE"] +" "+ data["DEPARTURE_TIME"].str.zfill(4))# Calculate time difference in minutestime_delta = data["departure_datetime"] - data["service_datetime"]data["service_mins"] = time_delta / pd.Timedelta(minutes=1)
Time series. For this data, we observe no trends, seasonality or outliers.
def inspect_time_series(series, y_lab):""" Plot time-series. Parameters ---------- series : pd.Series Series containing the time series data (where index is the date). y_lab : str Y axis label. """# Label as "Date" and provided y_lab, and convert to dataframe df = series.rename_axis("Date").reset_index(name=y_lab)# Create plot fig = px.line(df, x="Date", y=y_lab) fig.update_layout(showlegend=False, width=700, height=400) fig.show()
# Plot daily arrivalsinspect_time_series(series=data.groupby(by=["ARRIVAL_DATE"]).size(), y_lab="Number of arrivals")# Calculate mean service length per day, dropping last day (incomplete)daily_service = data.groupby("SERVICE_DATE")["service_mins"].mean()daily_service = daily_service.iloc[:-1]# Plot mean service length each dayinspect_time_series(series=daily_service, y_lab="Mean consultation length (min)")
Histogram. For both inter-arrival times and service times, we observe a right skewed distribution. Hence, it would be good to try exponential, gamma and Weibull distributions.
def inspect_histogram(series, x_lab):""" Plot histogram. Parameters ---------- series : pd.Series Series containing the values to plot as a histogram. x_lab : str X axis label. """ fig = px.histogram(series) fig.update_layout( xaxis_title=x_lab, showlegend=False, width=700, height=400 ) fig.update_traces( hovertemplate=x_lab +": %{x}<br>Count: %{y}", name="" ) fig.show()
# Plot histogram of inter-arrival timesinspect_histogram(series=data["iat_mins"], x_lab="Inter-arrival time (min)")# Plot histogram of service timesinspect_histogram(series=data["service_mins"], x_lab="Consultation length (min)")
library(readr)
Step 2. Fit distributions and compare goodness-of fit
We will fit distributions and assess goodness-of-fit using the Kolmogorov-Smirnov (KS) Test. This is a common test which is well-suited to continuous distributions. For categorical (or binned) data, consider using chi-squared tests.
The KS Test returns a statistic and p value.
Statistic: Measures how well the distribution fits your data.
Higher values indicate a better fit.
Ranges from 0 to 1.
P-value: Tells you if the fit could have happened by chance.
Higher p-values suggest the data follow the distribution.
In large datasets, even good fits often have small p-values.
Ranges from 0 to 1.
scipy.stats is a popular library for fitting and testing statistical distributions. For more convenience, distfit, which is built on scipy, is another popular package which can test multiple distributions simultaneously (or evaluate specific distributions).
We will illustrate how to perform the targeted approach using scipy directly, and the comprehensive approach using distfit - but you could use either for each approach.
Targeted approach
To implement the targeted approach using scipy.stats…
def fit_distributions(data, dists):""" This function fits statistical distributions to the provided data and performs Kolmogorov-Smirnov tests to assess the goodness of fit. Parameters ---------- data : array_like The observed data to fit the distributions to. dists : list List of the distributions in scipy.stats to fit, eg. ["expon", "gamma"] Notes ----- A lower test statistic and higher p-value indicate a better fit to the data. """for dist_name in dists:# Fit distribution to the data dist =getattr(stats, dist_name) params = dist.fit(data)# Return results from Kolmogorov-Smirnov test ks_result = stats.kstest(data, dist_name, args=params)print(f"Kolmogorov-Smirnov statistic for {dist_name}: "+f"{ks_result.statistic:.4f} (p={ks_result.pvalue:.2e})")# Fit and run Kolmogorov-Smirnov test on the inter-arrival and service timesdistributions = ["expon", "gamma", "weibull_min"]print("Inter-arrival time")fit_distributions(data=data["iat_mins"].dropna(), dists=distributions)print("Service time")fit_distributions(data=data["service_mins"], dists=distributions)
Inter-arrival time
Kolmogorov-Smirnov statistic for expon: 0.1155 (p=0.00e+00)
Kolmogorov-Smirnov statistic for gamma: 0.2093 (p=0.00e+00)
Kolmogorov-Smirnov statistic for weibull_min: 0.1355 (p=0.00e+00)
Service time
Kolmogorov-Smirnov statistic for expon: 0.0480 (p=3.08e-264)
Kolmogorov-Smirnov statistic for gamma: 0.1226 (p=0.00e+00)
Kolmogorov-Smirnov statistic for weibull_min: 0.0696 (p=0.00e+00)
Unsurprisingly, the best fit for both is the exponential distribution (lowest test statistic).
We can create a version of our histograms from before but with the distributions overlaid, to visually support this.
def inspect_histogram_with_fits(series, x_lab, dist_name):""" Plot histogram with overlaid fitted distributions. Parameters ---------- series : pd.Series Series containing the values to plot as a histogram. x_lab : str X axis label. dist_name : str Name of the distributions in scipy.stats to fit, eg. "expon" """# Plot histogram with probability density normalisation fig = px.histogram(series, nbins=30, histnorm="probability density") fig.update_layout( xaxis_title=x_lab, showlegend=True, width=700, height=400 )# Fit and plot each distribution x = np.linspace(series.min(), series.max(), 1000) dist =getattr(stats, dist_name) params = dist.fit(series.dropna()) pdf_fitted = dist.pdf(x, *params[:-2], loc=params[-2], scale=params[-1]) fig.add_trace(go.Scatter(x=x, y=pdf_fitted, mode="lines", name=dist_name)) fig.show()
# Plot histogram with fitted distributioninspect_histogram_with_fits(series=data["iat_mins"].dropna(), x_lab="Inter-arrival time (min)", dist_name="expon")inspect_histogram_with_fits(series=data["service_mins"], x_lab="Service time (min)", dist_name="expon")
Comprehensive approach
To implement the comprehensive approach using distfit…
# Fit popular distributions to inter-arrivals timesdfit_iat = distfit(distr="popular", stats="RSS", verbose="silent")_ = dfit_iat.fit_transform(data["iat_mins"].dropna())# Fit popular distributions to service timesdfit_service = distfit(distr="popular", stats="RSS", verbose="silent")_ = dfit_service.fit_transform(data["service_mins"])
We can view a summary table from distfit.
The score column is the result from a goodness-of-fit test. This is set using stats in distfit (e.g. distfit(stats="RSS")). It provides several possible tests including:
RSS - residual sum of squares
wasserstein - Wasserstein distance
ks - Kolmogorov-Smirnov statistic
energy - energy distance
goodness_of_fit - general purpose test from scipy.stats.goodness_of_fit
For continuous data, ks is often a good choice - but for distfit, they use an unusual method for calculation of this statistic. In distfit, they resample from the fitted distribution and compare that to the original data. Meanwhile, our manual implementation just use the standard KS test from scipy.stats, which is the standard KS statistics that is commonly understood.
As such, we have left distfit with RSS. However, we can calculate a standard KS statistic ourselves using the function below - which, as we can see, matches up with our results above.
def calculate_ks(data, dfit_summary):""" Calculate standard Kolmogorov-Smirnov statistics for fitted distributions. This function applies the standard scipy.stats.kstest to data using the distribution parameters obtained from distfit, providing conventional KS statistics rather than distfit's resampling-based approach. Parameters ---------- data : pandas.Series The original data series used for distribution fitting. dfit_summary : pandas.DataFrame The summary DataFrame from a distfit object, containing fitted distribution names and parameters. Returns ------- pandas.DataFrame A DataFrame containing the distribution names, KS statistics, and p-values from the standard KS test. Notes ----- Lower KS statistic values indicate better fits to the data. """ results = []for idx, row in dfit_summary.iterrows(): dist_name = row["name"] dist_params = row["params"]# Perform KS test using scipy.stats.kstest ks_result = stats.kstest(data, dist_name, args=dist_params)# Store the results results.append({"name": dist_name,"ks": ks_result.statistic[0],"p_value": ks_result.pvalue[0] })# Create a DataFrame with the resultsreturn pd.DataFrame(results).sort_values(by="ks")
The distfit package has some nice visualisation functions. For example, using the inter-arrival times…
# PDF with all the distributions overlaiddfit_iat.plot(n_top=11, figsize=(7, 4))# CDF with all the distributions overlaiddfit_iat.plot(chart="cdf", n_top=11, figsize=(7, 4))# QQ plot with all distributions overlaiddfit_iat.qqplot(data["iat_mins"].dropna(), n_top=11, figsize=(7, 4))# Summary plot using the RSSdfit_iat.plot_summary(figsize=(7, 4))
(<Figure size 672x384 with 1 Axes>,
<Axes: title={'center': 'Beta (best fit)'}, xlabel='Probability Density Function (PDF)', ylabel='RSS (goodness of fit test)'>)
We can also use it to create a plot with a specific distribution overlaid, like in the targeted approach:
# To create a plot with a specific distribution overlaid...dfit = distfit(distr="expon")dfit.fit_transform(data["iat_mins"].dropna())dfit.plot(figsize=(7, 4))
It’s good to have an idea of good likely distributions before testing a few. You can either do this directly with scipy.stats, or use a tool like distfit. This tool has the benefit of being able to run a whole panel of different distributions easily and plot results.
In this case, manually we found exponential to be best. With distfit, there were a few distributions that were all very low scores (pareto, expon, genextreme). Choosing between these…
Exponential - simple, wide application, good for context, fewer parameters
Generalised pareto - useful if data has heavier tail (not the case here)
Generalised extreme value - more complex, spefically designed for modeling maximum values or extreme events
As such, exponential seems a good choice for inter-arrival and service times.
Robinson, Stewart. 2007. “Chapter 7: Data Collection and Analysis.” In Simulation: The Practice of Model Development and Use, 95–125. John Wiley & Sons.